Introduction to program R


data science diagram

Introduction to program R


data science diagram

Introduction to program R


data science diagram

Re-thinking data


How we’ve interacted with data dictates how we structure data mentally.

Screenshot of microsoft excel window

Re-thinking data


data science diagram

Today’s goals


  • Value: Any datum (single unit of data)
  • Object: Container for holding values
  • Indexing: Querying objects by position
  • Logic: Querying objects using logical operators

Values


A value is defined by its:

Symbols representing different data values
  • State
  • Attributes (metadata), e.g.:
    • Name
    • Class
  • Context

Values


Values are distinguished by similarities and differences in their multi-dimensional states, attributes, and contexts.

Symbols representing different data values, with types of values arranged by row

Values


Values are distinguished by similarities and differences in their multi-dimensional states, attributes, and contexts.

Symbols representing different data values, with types of values arranged by row

Values


Values are distinguished by similarities and differences in their multi-dimensional states, attributes, and contexts.

Symbols representing different data values, with types of values arranged by row

Classes of values


In R, the most commonly used types of values are:

  • Number values: numeric and integers
  • Character: string (i.e., symbols)
  • Factor: a smart combination of words/letters and integers
  • Logical: integer values of 0 and 1 assigned to words

Classes of values: Numeric


Number values can be either:

  • Numeric: double precision floating point numbers (on the user end this can be thought of as a decimal number)
  • Integer: whole numbers
# Create a vector of numeric values:

numericV <-
  c(3, 2, 1, 1)

numericV

Classes of values: Numbers


Number values can be either:

  • Numeric: double precision floating point numbers (on the user end this can be thought of as a decimal number)
  • Integer: whole numbers
# What type of object is this?

class(numericV)

str(numericV)

summary(numericV)

Classes of values: Numbers


Number values can be either:

  • Numeric: double precision floating point numbers (on the user end this can be thought of as a decimal number)
  • Integer: whole numbers
# Create a vector of numeric integer values:

numericInteger <- 
  1:5

numericInteger

# What type of object is this?

class(numericInteger)

str(numericInteger)

summary(numericInteger)

Classes of values: Character


A character or “string” value is a symbol or set of symbols from a given library

# Create a vector of character values:

exampleCharacter <- 
  c('three', 'two', 'one', 'one')

exampleCharacter

# What type of object is this?

class(exampleCharacter)

str(exampleCharacter)

summary(exampleCharacter)

Classes of values: Factor


A factor value includes the following information:

  • Integer value: Numeric integer value associated with factor level
  • Levels: Character values associated with integer value
  • Labels: Characters to assign to each factor level
# Create a vector of factor values:

exampleFactor <- 
  factor(
    c('three', 'two', 'one', 'one'))

exampleFactor

# What type of object is this?

class(exampleFactor)

str(exampleFactor)

summary(exampleFactor)

Classes of values: Factor


Barplot with three bars representing factor levels

Classes of values: Factor


A factor value includes the following information:

  • Integer value: Numeric integer value associated with factor level
  • Levels: Character values associated with integer value
  • Labels: Characters to assign to each factor level
# Set factor levels and labels:

factor(
  c('three', 'two', 'one', 'one'))

factor(
  c('three', 'two', 'one', 'one'),
  levels = c('one', 'two', 'three')
  )

Classes of values: Factor


Barplot with three bars representing factor levels

Classes of values: Factor


A factor value includes the following information:

  • Integer value: Numeric integer value associated with factor level
  • Levels: Character values associated with integer value
  • Labels: Characters to assign to each factor level
# Set factor levels and labels:

factor(
  c('three', 'two', 'one', 'one'))

factor(
  c('three', 'two', 'one', 'one'),
  levels = c('one', 'two', 'three'),
  labels = c('One', 'Two', 'Three')
  )

Classes of values: Factor


Barplot with three bars representing factor levels

Classes of values: Logical


R reserves the words TRUE and FALSE as logical constants. These constants are mapped to integer values:

  • FALSE: 0
  • TRUE: 1
# Observe the behavior of logical values:

FALSE

TRUE

as.numeric(FALSE)

as.numeric(TRUE)

FALSE + TRUE

FALSE + TRUE + TRUE

Classes of values: Logical


Logical values can be obtained by evaluating objects with logical operators. For example, the logical operator == tests whether a value is equal to another value.

# The "is equal to" logical operator:

3 == 3

3 == 4

3 == 2 + 1

3 == 3 + 1

(3 == 3) + (3 == 2 + 1)

Objects: Containers for values


In R, containers called objects structure collections of values. Different types of objects store values in different ways:


Object dimensions Homogeneous class Heterogeneous class
1-D Atomic vector List
2-D Matrix Data frame

Objects: Containers for values


For each object type, we’ll address:

  • Structure
  • Indexing
  • Attributes

Vector objects: Structure


An atomic vector is a one-dimensional collection of values. All values must be of the same class.

Each value in a vector has a position, denoted by “[x]”


[1] [2] [3] [4]
1 1 2 3

Vector objects: Structure


An atomic vector is a one-dimensional collection of values. All values must be of the same class.

# A vector of numeric values:

numericVector <- 
  c(1, 1, 2, 3)

numericVector
## [1] 1 1 2 3
summary(numericVector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    1.50    1.75    2.25    3.00

Vector objects: Structure


An atomic vector is a one-dimensional collection of values. All values must be of the same class.

# All values in a vector must be of the same class:

numericVector
## [1] 1 1 2 3
messyVector <- 
  c(1, 'one', 2, 3)

messyVector
## [1] "1"   "one" "2"   "3"

Vector objects: Indexing


Each value in a vector has a position, denoted by “[x]”

[1] [2] [3] [4]
1 1 2 3
# Use indexing to subset a vector:

numericVector

numericVector[3]

numericVector[3:4]

numericVector[c(1,3)]

Vector objects: Attributes


Typical attributes we are interested in of vectors include:

  • Class: What type of values?
  • Length: How many values?
# Attributes of the vector:

class(numericVector)

length(numericVector)

str(numericVector)

Vector objects: Attributes


Attributes can be added to vectors.


# Adding attributes to a vector:

numericVector

names(numericVector)

names(numericVector) <- 
  c('orange', 'pear', 'apple', 'apple')

Vector objects: Attributes


Vectors can be indexed by their names attribute.

[‘orange’] [‘pear’] [‘apple’] [‘apple’]
1 1 2 3
numericVector[2]

numericVector['pear']

numericVector[2] == numericVector['pear']

numericVector[c('orange', 'pear')]

Matrix objects: Structure


A matrix is a two dimensional object – basically a vector that has been split into multiple columns. All values must be of the same class.

Values in a matrix have a row (x) and column (y) position, denoted by “[x, y]”


[ ,1] [ ,2]
[1, ] 1 2
[2, ] 1 3

Matrix objects: Structure


A matrix is a two dimensional object – basically a vector that has been positioned as multiple columns. All values must be of the same class.

# Generate matrix:

m <- matrix(c(1, 1, 2, 3), ncol = 2)

m
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3

Matrix objects: Structure


A vector can be structured horizontally (row-wise) or vertically (column-wise) within a matrix:

# Compare matrices built row-wise and column-wise:

matrix(
  c(1, 1, 2, 3),
  ncol = 2, 
  byrow = TRUE)
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    3
matrix(
  c(1, 1, 2, 3),
  ncol = 2, 
  byrow = FALSE)
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3

Matrix objects: Structure


Because matrices must be homogeneous, all values are forced to be the same type.

# Matrix built with multiple types:

messyMatrix <- 
  matrix(
    c(1, 'one', 2, 3),
    ncol = 2)

messyMatrix
##      [,1]  [,2]
## [1,] "1"   "2" 
## [2,] "one" "3"

Matrix objects: Indexing


Values in a matrix have a row (x) and column (y) position, denoted by “[x, y]”

[ ,1] [ ,2]
[1, ] 1 2
[2, ] 1 3
# Index by row (x) and column (y) position [x,y]:

m[1,1]

m[2,2]

m[1:2,2]

Matrix objects: Attributes


There are a number of attributes that can be observed for a given matrix:

# View matrix attributes:

class(m)

length(m)

dim(m)

str(m)

summary(m)

Matrix objects: Attributes


You may add a name attribute to rows and columns.

# Naming rows and columns:

colnames(m) <- 
  c('a', 'b')

rownames(m) <- 
  c('c', 'd')

attributes(m)

m

List objects: Structure


A list is a one dimensional object constructed by combining ANY objects with ANY dimensionality.

List position is denoted by [[x]].

[[1]]

[1] [2] [3] [4]
1 1 2 3

[[2]]

[ ,1] [ ,2]
[1, ] 1 2
[2, ] 1 3

[[3]]

[ ,1] [ ,2]
[1, ] “1” “2”
[2, ] “one” “3”

List objects: Structure


A list is a one dimensional object constructed by combining ANY objects with ANY dimensionality.

# List of a numeric vector and matrices:

exampleList <- 
  list(numericVector, m, messyMatrix)

exampleList
## [[1]]
## [1] 1 1 2 3
## 
## [[2]]
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## 
## [[3]]
##      [,1]  [,2]
## [1,] "1"   "2" 
## [2,] "one" "3"

List objects: Indexing


A list is a one dimensional object constructed by combining ANY objects with ANY dimensionality.

List position is denoted by [[x]].

[[1]]

[1] [2] [3] [4]
1 1 2 3

[[2]]

[ ,1] [ ,2]
[1, ] 1 2
[2, ] 1 3

[[3]]

[ ,1] [ ,2]
[1, ] “1” “2”
[2, ] “one” “3”

List objects: Indexing


A list is a one dimensional object constructed by combining ANY objects with ANY dimensionality.

List position is denoted by [[x]].

# List indexing:

exampleList
## [[1]]
## [1] 1 1 2 3
## 
## [[2]]
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## 
## [[3]]
##      [,1]  [,2]
## [1,] "1"   "2" 
## [2,] "one" "3"
exampleList[[2]]
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3

List objects: Indexing


A list is a one dimensional object constructed by combining ANY objects with ANY dimensionality.

List position is denoted by [[x]].

# List indexing:

exampleList[[2]]

exampleList[[2]] == m

m[2,2]

exampleList[[2]][2,2]

List objects: Attributes


Typical attributes we are interested in of lists include:

  • Class: What type of values?
  • Length: How many values?
# Attributes of a list:

class(exampleList)

length(exampleList)

str(exampleList)

List objects: Attributes


Typical attributes we are interested in of lists include:

  • Class: What type of values?
  • Length: How many values?
# Attributes of list items:

class(exampleList[[1]])

length(exampleList[[1]])

List objects: Attributes


Attributes can be added to lists


# Adding attributes to a list:

exampleList

names(exampleList)

names(exampleList) <- 
  c('numericVector', 'm', 'messyMatrix')

attributes(exampleList)

List objects: Attributes

Lists can be indexed by their names attribute using matrix notation or the $ operator.

# Lists can be indexed by name using the notation:

exampleList[[3]]

exampleList[['messyMatrix']]

exampleList$messyMatrix

Data frame objects: Structure


A data frame is a two dimensional object constructed by combining vectors.

Each value in a data frame has a row and column position, denoted by “[x, y]”


[ ,1] [ ,2]
[1, ] 1 1
[2, ] 2 3

Data frame objects: Structure


A data frame is a two dimensional object constructed by combining vectors.

# Generate a data frame:

df <- 
  data.frame(a = c(1, 1), b =  c(2, 3))

df
##   a b
## 1 1 2
## 2 1 3

Data frame objects: Structure


The vectors that are contained in a data frame may be of different classes.

# Generate a data frame of different vector classes:

data.frame(
  a = c('one', 'one'),
  b =  c(2, 3))
##     a b
## 1 one 2
## 2 one 3

Data frame objects: Structure


But vectors are still coerced into the same class!

# Attempt to generate a data frame with heterogeneous vectors:

messyDf <- data.frame(
  a = c(1, 'one'),
  b =  c(2, 3))

messyDf
##     a b
## 1   1 2
## 2 one 3

Data frame objects: Indexing


Values in a data frame have a row (x) and column (y) position, denoted by “[x, y]”


[ ,1] [ ,2]
[1, ] 1 1
[2, ] 2 3
# Index by row (x) and column (y) position [x,y]:

df[1,1]

df[2,2]

df[1:2,2]

Data frame objects: Attributes


There are a number of attributes that can be observed for a given data frame:

# View data frame attributes:

str(df)

class(df)

length(df)

dim(df)

summary(df)

Data frame objects: Attributes


Always check attributes prior to working with data frame!

# View attributes of the messy dataframe:

str(messyDf)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: Factor w/ 2 levels "1","one": 1 2
##  $ b: num  2 3
dfStrings <- data.frame(
  a = c(1, 'one'), 
  b =  c(2, 3),
  stringsAsFactors = FALSE
  )

str(dfStrings)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: chr  "1" "one"
##  $ b: num  2 3

Data frame objects: Attributes


Name attributes are automatically set when a data frame is created. Failing to set this attribute leads to bad names:

# Set and unset names:

data.frame(
  a = c(1, 1), 
  b =  c(2, 3))
##   a b
## 1 1 2
## 2 1 3
data.frame(
  c(1, 1),
  c(2, 3))
##   c.1..1. c.2..3.
## 1       1       2
## 2       1       3

Data frame objects: Attributes


Similar to other objects, the names attribute can also be set manually after an object is created:

# View data frame attributes:

exampleDf <- 
  data.frame(
    c(1, 1),
    c(2, 3))

names(exampleDf) <- 
  c('hello', 'world')

exampleDf
##   hello world
## 1     1     2
## 2     1     3

Data frame objects: Attributes


Data frames can be indexed by their names attribute using matrix notation or the $ operator.

# View data frame attributes:

exampleDf['hello']


exampleDf$hello

Data frame objects: The tibble!


A tibble is a special type of data frame provided by the package tidyverse.

# Read tidyverse package(s):

library(tidyverse)

# Generate a tibble data frame:

tibbleDf <- 
  data_frame(
    a = c(1, 'one'),
    b =  c(2, 3))

tibbleDf
## # A tibble: 2 x 2
##   a         b
##   <chr> <dbl>
## 1 1        2.
## 2 one      3.

Data frame objects: The tibble!


Base R data frames can also be converted to a tibble.

# Convert a data frame to a tbl:

tbl_df(messyDf)
## # A tibble: 2 x 2
##   a         b
##   <fct> <dbl>
## 1 1        2.
## 2 one      3.
tbl_df(
  data.frame(
    a = c(1, 'one'), 
    b =  c(2, 3)))
## # A tibble: 2 x 2
##   a         b
##   <fct> <dbl>
## 1 1        2.
## 2 one      3.

Data frame objects: The tibble!


How do tibbles differ from Base R data frames?

# Compare tibble and base R data frame:

data.frame(
  a = c(1, 'one'),
  b =  c(2, 3))
##     a b
## 1   1 2
## 2 one 3
data_frame(
  a = c(1, 'one'),
  b =  c(2, 3))
## # A tibble: 2 x 2
##   a         b
##   <chr> <dbl>
## 1 1        2.
## 2 one      3.

Data frame objects: The tibble!


How do tibbles differ from Base R data frames?

# Load data from:

data(mtcars)

mtcars

tbl_df(mtcars)

Summary


Symbols representing different data values
Symbols representing different data values
  • Values:
    • Numbers
    • Characters
    • Factors
    • Logical values

  • Objects:
    • Vectors
    • Matrices
    • Lists
    • Data frames

Functions


The basic structure functions (DO NOT RUN):

functionName <- function(function_argruments) {
    functionBody
}


Functions

A function may have one or many arguments. Typically (but, unfortunately, not always), the first argument in a function is the data the function is acting on. For example, the function mean acts on a vector of numbers.

v <- 
  c(1, 1, 2)

mean(v)
## [1] 1.333333

Functions

Subsequent arguments are often used to modify the behavior of the function. Let’s look, for example, at the behavior of the function mean with and without the argument na.rm = TRUE

v <- 
  c(1, 1, 2, NA)

mean(v)
## [1] NA
mean(v, na.rm = TRUE)
## [1] 1.333333

Functions

Almost everything we do in R requires a function. For example, reading in data:

read.csv('https://www.dropbox.com/s/rzi1ghq0bg24coh/exampleTelemetry.csv?dl=1')
##           x       y bearing
## 1 -78.16463 38.8848     358
## 2 -78.16356 38.8872     264
## 3 -78.16654 38.8882     128

Functions

Almost everything we do in R requires a function. For example, reading in data:

read_csv('https://www.dropbox.com/s/rzi1ghq0bg24coh/exampleTelemetry.csv?dl=1')
## Parsed with column specification:
## cols(
##   x = col_double(),
##   y = col_double(),
##   bearing = col_integer()
## )
## # A tibble: 3 x 3
##       x     y bearing
##   <dbl> <dbl>   <int>
## 1 -78.2  38.9     358
## 2 -78.2  38.9     264
## 3 -78.2  38.9     128

Writing custom functions

# First function:

addOneFun <- function(x){
    x+1
}

Writing custom functions

# Testing the function on a numeric value:

42+1
## [1] 43
addOneFun(42)
## [1] 43

Writing custom functions

# Testing the function on a vector of numeric values:

v <- c(1,1,2,3,5)

v + 1
## [1] 2 2 3 4 6
addOneFun(v)
## [1] 2 2 3 4 6


Exercise


The mathematical formula for standard error is provided below. Convert this to an R function (Note: the function for standard deviation is sd and the function for square root is sqrt): \[StdErr (x) = \frac{StDev(x)}{\sqrt{n}}\]